Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Helm Chart for OpenVINO vLLM #403

Open
wants to merge 45 commits into
base: main
Choose a base branch
from

Conversation

krish918
Copy link

@krish918 krish918 commented Sep 5, 2024

Description

This PR introduces OpenVINO vLLM as a new inference generation microservice. It integrates with ChatQnA application, enhancing its capabilities to utilize either of TGI or OpenVINO vLLM as underlying inference engine. The decision to use TGI or OpenVINO vLLM can be made while installing the helm chart itself. This is achieved by turning off a flag corresponding to TGI and turning on a flag corresponding to vLLM service.

Example:

helm install chatqna chatqna --set tgi.enabled=false --set vllm.enabled=true

In above command, tgi.enabled flag is turned off which is honored by chart to avoid deploying TGI microservice. At the same time vllm.enabled flag is switched to true, which enables deployment of vLLM service.

To use the OpenVINO optimized vLLM, a separate values file is present which can be used to deploy vLLM OpenVINO microservice. When using OpenVINO values file, switching flags for tgi or vllm is not required as they are done inside the vllm-openvino-values.yaml file only.

Example:

helm -f ./chatqna/vllm-openvino-values.yaml install chatqna chatqna 

Issues

No known issues yet.

Type of change

List the type of change like below. Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds new functionality)
  • Breaking change (fix or feature that would break existing design and interface)

Dependencies

Following two docker images are required as new dependencies:

  • vllm:openvino : If not available, docker image can be built from official vLLM repo.
  • opea/llm-vllm:latest : This image is available publicly on dockerhub.

Tests

Two test pods for both the new microservices are part of newly added helm charts. These pods constituting basic curl tests for the testing connection and proper response from both the microservices. After spinning up these services, we can run following commands with appropriate values for placeholders to test the sanity of services:

helm install <release-name> <chart-name>
helm test <release-name>

Copy link
Collaborator

@lianhao lianhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing this. Seems it needs some fix to pass the CI. Please also see my comment.

Since @yongfengdu is also working on vllm related helm-chart, I'll let him comment more on this.

object:
metric:
# VLLM time metrics are in seconds
name: vllm_ov_request_latency
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does this metric comes from?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this resource as this metric was not setup. Anyway, this vllm-openvino resources will not be used and will be replaced by OpenVINO specific values file.

labels:
{{- include "vllm-openvino.labels" . | nindent 4 }}
data:
MODEL_ID: {{ .Values.global.LLM_MODEL_ID | quote }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think using global.LLM_MODEL_ID is a good idea. The charts in common directory is used as components to construct the e2e AI workload, so it's possible that the same vllm-openvino chart could be used twice in the e2e helm charts(e.g. chatqna, etc.) with 2 different models. So each vllm-openvino chart should have its own LLM_MODEL_ID settings.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. I've updated this in new commits.

PORT: {{ .Values.service.port | quote }}
HF_TOKEN: {{ .Values.global.HUGGINGFACEHUB_API_TOKEN | quote}}
VLLM_CPU_KVCACHE_SPACE: {{ .Values.VLLM_CPU_KVCACHE_SPACE | quote }}
HABANA_VISIBLE_DEVICES : {{ .Values.HABANA_VISIBLE_DEVICES | quote }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is using HABANA_VISIBLE_DEVICES? On K8S, it's the habana k8s-device-plugin to allocate the habana device into the container.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this variable after confirming not being used.

no_proxy: {{ .Values.global.no_proxy | quote }}
HABANA_LOGS: "/tmp/habana_logs"
NUMBA_CACHE_DIR: "/tmp"
TRANSFORMERS_CACHE: "/tmp/transformers_cache"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use TRANSFORMERS_CACHE any more for tgi, but does this required for openvino. For environment in configmap, please make sure that everything you set here is actually play a role for actual application in the k8s pod

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this deprecated variable.

replicaCount: 1

image:
repository: vllm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this image comes from? Could you please provide the link to this image on docker hub?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This image is yet not available on Dockerhub. The URL is dockerfile is added in description: vLLM OpenVINO Dockerfile.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, we should defer this until the container image is available, otherwise, people can not use the helm chart directly.

TRANSFORMERS_CACHE: "/tmp/transformers_cache"
HF_HOME: "/tmp/.cache/huggingface"
{{- if .Values.MAX_INPUT_LENGTH }}
MAX_INPUT_LENGTH: {{ .Values.MAX_INPUT_LENGTH | quote }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on openvino documentation about environment variable, I don't think it recognize this MAX_INPUT_LENGTH env

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this variable.

MAX_INPUT_LENGTH: {{ .Values.MAX_INPUT_LENGTH | quote }}
{{- end }}
{{- if .Values.MAX_TOTAL_TOKENS }}
MAX_TOTAL_TOKENS: {{ .Values.MAX_TOTAL_TOKENS | quote }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto as MAX_INPUT_LENGTH

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Copy link
Collaborator

@yongfengdu yongfengdu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the backend components, I just integrated the vllm helm charts one or two days ago, I think it should be able to support the vllm-openvino case with just a replacement of docker image(In this case, we can provide a vllm-openvino.yaml to specify different values there). Could you check this? (https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/common/vllm)

For the microservice llm-vllm, since it's the same layer with current existing llm-uservice and providing same function, I am looking to see if that's possible to use just one. Even one further step, to use just one llm docker image to support both tgi and vllm backend (Now we have 2 images: llm-tgi and llm-vllm). This may need the llm.py changes before we can merge, so I'm ok now with one more llm-vllm-uservice.

It would be great to support more cases/scenarios with less helm charts to reduce the maintenance efforts.

@@ -9,9 +9,27 @@ dependencies:
- name: tgi
version: 0.9.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest version is updated to 1.0.0, for all components.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. updated all the versions.


Helm chart for deploying a microservice which facilitates connections and handles responses from OpenVINO vLLM microservice.

`llm-vllm-uservice` depends on OpenVINO vLLM. You should properly set `vLLM_ENDPOINT` as the HOST URI of vLLM microservice. If not set, it will consider the default value : `http://<helm-release-name>-vllm-openvino:80`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember hit a limitation of 15 characters for the Chart name, maybe this is no longer a limitation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was able to work with longer names, so doesn't seem like a limitation any more.

@@ -0,0 +1,68 @@
# OpenVINO vLLM

Helm chart for deploying OpenVINO optimized vLLM Inference service.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned earlier, please try using the vllm helm charts with vllm-openvino.yaml values.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a separate values file for OpenVINO.

@lianhao
Copy link
Collaborator

lianhao commented Sep 11, 2024

@krish918 Can you confirm that the vllm container image used by `vllm-openvino' chart is publicly available? Otherwise I don't think it's the right time to add a helm chart which is can NOT be installed. We should either ask the vllm upstream to release their container image, or our OPEA community to release the corresponding container image. K8S is not like docker-compose, we normally don't control the node where the vllm-openvino will be run, so we can't do the manual container image build process like the docker-compose usage scenario, unless we login into all the k8s cluster nodes and build the container image on every node.

@yongfengdu
Copy link
Collaborator

@krish918 Please refer to these 2 commit for my suggestions of reusing vllm and llm-uservice helm charts for openvino support.
yongfengdu@b991659
yongfengdu@4cdb585

Signed-off-by: Krishna Murti <[email protected]>
Signed-off-by: Krishna Murti <[email protected]>
@krish918
Copy link
Author

@lianhao @yongfengdu @chensuyue opea/vllm-openvino Image is now available and accessible. All the CI checks are passing. Also, a separate openvino-values file are added to chatqna as well as to vllm subchart for independent deployment. Please have a look.

Copy link
Collaborator

@yongfengdu yongfengdu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to #473
The CI was modified to test only with ci-*values.yaml files.
By default the new vllm-openvino-values.yaml will not be tested. It's up to you to decide whether to enable vllm-openvino test or not. If you want to enable it, add a link with "ln -s vllm-openvino-values.yaml ci-vllm-openvino-values.yaml"

For llm-ctrl-uservice, We have plan to merge this with llm-uservice(And also merge tei/teirerank), but I'm OK to add this for this time.

- name: vllm
version: 1.0.0
repository: file://../vllm
condition: autodependency.enabled
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

autodependency.enabled is no longer used, use vllm.enabled instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

export https_proxy=<your_https_proxy>

helm dependency update
helm install llm-ctrl-uservice . --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set LLM_MODEL_ID=${MODELNAME} --set vllm.LLM_MODEL_ID=${MODELNAME} --set autodependency.enabled=true --set global.http_proxy=${http_proxy} --set global.https_proxy=${https_proxy} --wait
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vllm.enabled

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

autodependency:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/autodependency/vllm/

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@krish918
Copy link
Author

@yongfengdu - enabled CI for new values file.

Comment on lines +26 to +33
- name: llm-uservice
version: 1.0.0
repository: "file://../common/llm-uservice"
condition: tgi.enabled
- name: llm-ctrl-uservice
version: 1.0.0
repository: "file://../common/llm-ctrl-uservice"
condition: vllm.enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you're adding wrappers?

They were removed over month ago for v1.1 (#474), are unnecessary, and LLM wrapper uses a langserve component with a problematic license (opea-project/GenAIComps#264).

Comment on lines +13 to +22
For LLM inference, two more microservices will be required. We can either use [TGI](https://github.com/huggingface/text-generation-inference) or [vLLM](https://github.com/vllm-project/vllm) as our LLM backend. Depending on that, we will have following microservices as part of dependencies for ChatQnA application.

1. For using **TGI** as an inference service, following 2 microservices will be required:

- [llm-uservice](../common/llm-uservice/README.md)
- [tgi](../common/tgi/README.md)

2. For using **vLLM** as an inference service, following 2 microservices would be required:

- [llm-ctrl-uservice](../common/llm-ctrl-uservice/README.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, why add wrappers?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is from 1.0 release time, so with some old code.
I think it's better to merge with #610 , or just simple changes to support openvino after 610 get merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, but note that I'm testing my PR only with vLLM Gaudi version.

I.e. currently both CPU and GPU/Openvino support need to be added / tested after it.

That PR has also quite a few comment TODOs about vLLM options where some feedback would be needed / appreciated.

- [llm-ctrl-uservice](../common/llm-ctrl-uservice/README.md)
- [vllm](../common/vllm/README.md)

> **_NOTE :_** We shouldn't have both inference engine deployed. It is required to only setup either of them. To achieve this, conditional flags are added in the chart dependency. We will be switching off flag corresponding to one service and switching on the other, in order to have a proper setup of all ChatQnA dependencies.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why there could not be multiple inferencing engines?

ChatQnA has 4 inferencing subservices for which it is already using 2 inferencing engines, TEI and TGI.

And I do not see why it could not use e.g. TEI for embed + rerank, TGI for guardrails, and vLLM for LLM.

Please rephrase.

Comment on lines +95 to +96
2. Please set `http_proxy`, `https_proxy` and `no_proxy` values while installing chart, if you are behind a proxy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO duplicating general information to application READMEs is not maintainable, there are too many of them. Instead you could include link to general options (helm-charts/README.md).

curl http://localhost:8888/v1/chatqna \
-X POST \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add redundant POST? -d already implies that (see man curl).


image:
repository: opea/vllm-openvino
pullPolicy: IfNotPresent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the value, it breaks CI testing for latest tag (see #587).


image:
repository: opea/llm-vllm
pullPolicy: IfNotPresent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the value, it breaks CI testing for latest tag (see #587).

openvino_enabled: true
image:
repository: opea/vllm-openvino
pullPolicy: IfNotPresent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the value, it breaks CI testing for latest tag (see #587).

Comment on lines +47 to +50
# We usually recommend not to specify default resources and to leave this as a conscious
# choice for the user. This also increases chances charts run on environments with little
# resources, such as Minikube. If you do want to specify resources, uncomment the following
# lines, adjust them as necessary, and remove the curly braces after 'resources:'.
Copy link
Contributor

@eero-t eero-t Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is obsolete. Resource request should match the actual usage (+ some space for growth), but there are some complications. See discussion in #431.

helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME}
```

```bash
# To use Gaudi device
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that there's support for both TGI and vLLM, all these comments here could state which one is used, e.g. like this:

Suggested change
# To use Gaudi device
# To use Gaudi device for TGI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants